Goto

Collaborating Authors

 nar model




DYNAC: Dynamic Vocabulary based Non-Autoregressive Contextualization for Speech Recognition

Sudo, Yui, Fukumoto, Yosuke, Shakeel, Muhammad, Peng, Yifan, Lin, Chyi-Jiunn, Watanabe, Shinji

arXiv.org Artificial Intelligence

Contextual biasing (CB) improves automatic speech recognition for rare and unseen phrases. Recent studies have introduced dynamic vocabulary, which represents context phrases as expandable tokens in autoregressive (AR) models. This method improves CB accuracy but with slow inference speed. While dynamic vocabulary can be applied to non-autoregressive (NAR) models, such as connectionist temporal classification (CTC), the conditional independence assumption fails to capture dependencies between static and dynamic tokens. This paper proposes DYNAC (Dynamic Vocabulary-based NAR Contextualization), a self-conditioned CTC method that integrates dynamic vocabulary into intermediate layers. Conditioning the encoder on dynamic vocabulary, DYNAC effectively captures dependencies between static and dynamic tokens while reducing the real-time factor (RTF). Experimental results show that DYNAC reduces RTF by 81% with a 0.1-point degradation in word error rate on the LibriSpeech 960 test-clean set.


Algorithm-Informed Graph Neural Networks for Leakage Detection and Localization in Water Distribution Networks

Zhang, Zepeng, Fink, Olga

arXiv.org Artificial Intelligence

Detecting and localizing leakages is a significant challenge for the efficient and sustainable management of water distribution networks (WDN). Leveraging the inherent graph structure of WDNs, recent approaches have used graph-based data-driven methods. However, these methods often learn shortcuts that work well with in-distribution data but fail to generalize to out-of-distribution data. To address this limitation and inspired by the perfect generalization ability of classical algorithms, we propose an algorithm-informed graph neural network (AIGNN). Recognizing that WDNs function as flow networks, incorporating max-flow information can be beneficial for inferring pressures. In the proposed framework, we first train AIGNN to emulate the Ford-Fulkerson algorithm for solving max-flow problems. This algorithmic knowledge is then transferred to address the pressure estimation problem in WDNs. Two AIGNNs are deployed, one to reconstruct pressure based on the current measurements, and another to predict pressure based on previous measurements. Leakages are detected and localized by comparing the outputs of the reconstructor and the predictor. By pretraining AIGNNs to reason like algorithms, they are expected to extract more task-relevant and generalizable features. Experimental results demonstrate that the proposed algorithm-informed approach achieves superior results with better generalization ability compared to GNNs that do not incorporate algorithmic knowledge.


Reinforcement Learning for Edit-Based Non-Autoregressive Neural Machine Translation

Wang, Hao, Morimura, Tetsuro, Honda, Ukyo, Kawahara, Daisuke

arXiv.org Artificial Intelligence

Non-autoregressive (NAR) language models are known for their low latency in neural machine translation (NMT). However, a performance gap exists between NAR and autoregressive models due to the large decoding space and difficulty in capturing dependency between target words accurately. Compounding this, preparing appropriate training data for NAR models is a non-trivial task, often exacerbating exposure bias. To address these challenges, we apply reinforcement learning (RL) to Levenshtein Transformer, a representative edit-based NAR model, demonstrating that RL with self-generated data can enhance the performance of edit-based NAR models. We explore two RL approaches: stepwise reward maximization and episodic reward maximization. We discuss the respective pros and cons of these two approaches and empirically verify them. Moreover, we experimentally investigate the impact of temperature setting on performance, confirming the importance of proper temperature setting for NAR models' training.


NARRepair: Non-Autoregressive Code Generation Model for Automatic Program Repair

Yang, Zhenyu, Yang, Zhen, Yu, Zhongxing

arXiv.org Artificial Intelligence

With the advancement of deep learning techniques, the performance of Automatic Program Repair(APR) techniques has reached a new level. Previous deep learning-based APR techniques essentially modified program sentences in the Autoregressive(AR) manner, which predicts future values based on past values. Due to the manner of word-by-word generation, the AR-based APR technique has a huge time delay. This negative consequence overshadows the widespread adoption of APR techniques in real-life software development. To address the issue, we aim to apply the Non-Autoregressive(NAR) method to the APR task, which can output target code in a parallel manner to avoid huge inference delays. To effectively adapt the NAR manner for the APR task, we in this paper propose NARRepair, the first customized NAR code generation model for the APR task. The NARRepair features three major novelties, including 1) using repair actions to alleviate the over-correction issue, 2) extracting dependency information from AST to alleviate the issue of lacking inter-word dependency information, 3) employing two-stage decoding to alleviate the issue of lacking contextual information. We evaluated NARRepair on three widely used datasets in the APR community, and the results show that our technique can significantly improve the inference speed while maintaining high repair accuracy.


VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

Chen, Sanyuan, Liu, Shujie, Zhou, Long, Liu, Yanqing, Tan, Xu, Li, Jinyu, Zhao, Sheng, Qian, Yao, Wei, Furu

arXiv.org Artificial Intelligence

This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Based on its predecessor, VALL-E, the new iteration introduces two significant enhancements: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in the decoding history. It not only stabilizes the decoding but also circumvents the infinite loop issue. Grouped Code Modeling organizes codec codes into groups to effectively shorten the sequence length, which not only boosts inference speed but also addresses the challenges of long sequence modeling. Our experiments on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases. The advantages of this work could contribute to valuable endeavors, such as generating speech for individuals with aphasia or people with amyotrophic lateral sclerosis.


Scaling the Vocabulary of Non-autoregressive Models for Efficient Generative Retrieval

Valluri, Ravisri, Mohankumar, Akash Kumar, Dave, Kushal, Singh, Amit, Jiao, Jian, Varma, Manik, Sinha, Gaurav

arXiv.org Artificial Intelligence

Generative Retrieval introduces a new approach to Information Retrieval by reframing it as a constrained generation task, leveraging recent advancements in Autoregressive (AR) language models. However, AR-based Generative Retrieval methods suffer from high inference latency and cost compared to traditional dense retrieval techniques, limiting their practical applicability. This paper investigates fully Non-autoregressive (NAR) language models as a more efficient alternative for generative retrieval. While standard NAR models alleviate latency and cost concerns, they exhibit a significant drop in retrieval performance (compared to AR models) due to their inability to capture dependencies between target tokens. To address this, we question the conventional choice of limiting the target token space to solely words or sub-words. We propose PIXAR, a novel approach that expands the target vocabulary of NAR models to include multi-word entities and common phrases (up to 5 million tokens), thereby reducing token dependencies. PIXAR employs inference optimization strategies to maintain low inference latency despite the significantly larger vocabulary. Our results demonstrate that PIXAR achieves a relative improvement of 31.0% in MRR@10 on MS MARCO and 23.2% in Hits@5 on Natural Questions compared to standard NAR models with similar latency and cost.


SpeechAlign: Aligning Speech Generation to Human Preferences

Zhang, Dong, Li, Zhaowei, Li, Shimin, Zhang, Xin, Wang, Pengyu, Zhou, Yaqian, Qiu, Xipeng

arXiv.org Artificial Intelligence

Speech language models have significantly advanced in generating realistic speech, with neural codec language models standing out. However, the integration of human feedback to align speech outputs to human preferences is often neglected. This paper addresses this gap by first analyzing the distribution gap in codec language models, highlighting how it leads to discrepancies between the training and inference phases, which negatively affects performance. Then we explore leveraging learning from human feedback to bridge the distribution gap. We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences. SpeechAlign involves constructing a preference codec dataset contrasting golden codec tokens against synthetic tokens, followed by preference optimization to improve the codec language model. This cycle of improvement is carried out iteratively to steadily convert weak models to strong ones. Through both subjective and objective evaluations, we show that SpeechAlign can bridge the distribution gap and facilitating continuous self-improvement of the speech language model. Moreover, SpeechAlign exhibits robust generalization capabilities and works for smaller models. Code and models will be available at https://github.com/0nutation/SpeechGPT.


Non-autoregressive Sequence-to-Sequence Vision-Language Models

Shi, Kunyu, Dong, Qi, Goncalves, Luis, Tu, Zhuowen, Soatto, Stefano

arXiv.org Artificial Intelligence

Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions. We propose a parallel decoding sequence-to-sequence vision-language model, trained with a Query-CTC loss, that marginalizes over multiple inference paths in the decoder. This allows us to model the joint distribution of tokens, rather than restricting to conditional distribution as in an autoregressive model. The resulting model, NARVL, achieves performance on-par with its state-of-the-art autoregressive counterpart, but is faster at inference time, reducing from the linear complexity associated with the sequential generation of tokens to a paradigm of constant time joint inference.